AML-3204 Social Media Analytics

Project Title: Collaborative Filtering-based vs Hybrid Recommender System

Group Members

1. Importing Packages

This project will compare collaborative filtering-based recommender and hybrid recommender systems. To build the collaborative filtering-based recommender, it is essential to use the Matrix Factorization technique, and Embedding layer from the PyTorch package to build the hybrid recommender.

2. Data

3. Tweets from twitter:

Given the u.item dataset, extract a tagword using the movie name. Then, search Twitter using that tagword to download related tweets for that movie. Do the same for each movie. Clean the tweets, and then calculate the sentiment score for each tweet for each movie. Calculate the average sentiment score for each movie and save the average as the sentiment score for that movie.

3.1 extracting the tagword from the movie name

3.2 Extracting tweets

to extract the information, we used the snscrape package. We built a class called "Scraper" that contains the necessary functions to do the scraping. All the extracted information will be stored in csv files, so they can be later called by a pandas dataframe.

https://drive.google.com/file/d/1Q_dhdirnxfNlVuFXgQOXKUQyTta45ALQ/view?usp=share_link

3.3 Data cleaning

We created a class called Cleaner which has the necessary methods to clean the text:

3.4 Sentiment Analysis with vaderSentiment

In this section we perform the sentiment analysis with the text of each twitter. A positive, neutral, and negative sentiment will be assigned to each tweet. The vaderSentiment library will be used to calculate the sentiment. This package returns a value between -1 and 1 to refer to the sentiment.

4 Matrix Factorization

In a recommendation engine like Netflix or MovieLens, there is a user base and a catalogue of products (movies for the above two systems). Given that every user has given some products in the system a rating, we would want to estimate how they would rate the goods they have not yet given a rating for so that we may provide them suggestions. In this instance, a matrix may be used to represent all the data we know about the current ratings.

The rationale for utilising matrix factorization to address this issue is that there ought to be some latent characteristics that affect how a user evaluates a product. For instance, if two users share a preference for action movies, they could both give a certain movie high ratings if they enjoy the stars or actresses in it. Because the attributes connected with the user and the object should correspond, if we can uncover these latent traits, we should be able to predict a rating with regard to a certain user and an item. In our quest to identify the various features, we also operate under the presumption that there are fewer features overall than there are users and items. Given that it would be illogical to assume that each user is connected to a specific feature, this assumption should not be difficult to grasp (although this is not impossible). In any event, there would be no use in offering suggestions if this were the case because none of these people would be interested in the things that other users have rated. The same justification also holds true for the products.

Python implementation

Math behind Matrix Factorization

Firstly, we have a set U of users, and a set D of items. Let R of size |U| X |D| be the matrix that contains all the ratings that the users have assigned to the items. Also, we assume that we would like to discover K latent features. Our task, then, is to find two matrics matrices P (a |U| X K matrix) and Q (a |D| X K matrix) such that their product approximates R

Fromula1.png

We now need to figure out how to get P and Q. One method of solving this issue is to initialise the two matrices with certain values, determine how "different" the products of the two are from "mathbf M," and then attempt to reduce this difference repeatedly. This approach, known as gradient descent, seeks to identify a local minimum of the difference. The following equation may be used to determine this difference, which is typically referred to as the error between the estimated rating and the true rating, for each user-item pair:

formula2.png

Knowing which way to change the values of p and q will help us reduce mistake. To differentiate the aforementioned equation with regard to these two variables independently, we state that we need to know the gradient at the present values:

Formula3.png

we can now create the update rules for both pik and qkj after obtaining the gradient: Formula4.png

Here, α is a constant whose value determines the rate of approaching the minimum. Usually we will choose a small value for α, say 0.0002. This is because if we make too large a step towards the minimum we may run into the risk of missing the minimum and end up oscillating around the minimum. The process can then be carried out again until the error converges to its minimum using the aforementioned updating rules. The following equation may be used to compute the total error and help us decide whether to halt the procedure.

Formula5.png

Regularization is a frequent addition to this fundamental approach to prevent overfitting. To do this, include the parameter beta and change the squared error as follows:

formula6.png

5. Hybrid Recommender System

Making a deep learning model for a hybrid recommendation system is the aim. We'll get a lot of help with it from embedding layers (basically a layer that maps an index to a vector of trainable weights). The Neural Network that we implemented is the following:

NNDiagram.png

5.1 Data Preprocessing

Basically. What we need is to get a dataframe that contains all the features of the movies: Genre, Score from the users and the sentiment score that we got from the tweets. For that, we need to merge the movies_info, mean_sentiments and ratings dataframe

After dropping all the unnecesary features from the dataframes. we got the following dataframe

Then. We split the data into train and validation using 70% for trainning purposes and 30% for validaiton

5.2 Building the Neural Network

Builing the architecture of the neural network

There are some hyperparameters to tunne to get a better model. Such as: the latent vector size and the number of epochs. To know that. We iterated with different latent vector size and we compared the errors

After the experiment, we concluded that the best vector size is 100

Now, the same process for the epochs.

6. Results

6.1 Vector size vs MSE for Matrix Factorization recommender system

6.2 Vector size vs MSE for Hybrid Recommender System

6.3 MSE for both recommender systems

6.4 Evaluation metrics

First, we are going to extract the top 10 and 20 movies from the real dataset for the user_id 2. With that. we can compare the results of both of the recomender systems and calculate some evaluation metrics to display de performance of each system

The predicted rating for the top 20 movies for the user_id 2 using the hybrid recomender system are the following:

The predicted rating for the top 20 movies for the user_id 2 using the Matrix Factorization recommender system are the following:

Now, we can calculate some evaluation metrics such as RMSE,Precision, Recall, and F1-Score

MSE, RMSE

The mean of the squared difference between the data set's original and forecasted values is known as the mean squared error. It calculates the residuals' variance.

formula8.png

The square root of Mean Squared Error is called Root Mean Squared Error. It calculates the residuals' standard deviation.

Formula9.png

Precision

We can define precision as the ratio tp/(tp+fp). tp is the bumber of true positives and fp the number of false positives.Intuitively, the classifier's precision is its capacity not to classify a negative sample as positive.

1 is the best value, while 0 is the worst.

Using the 'micro' parameter, the function will calculate metrics globally by counting the total true positives, false negatives and false positives. When true positive + false positive == 0, the functino will return 0 and raises UndefinedMetricWarning. This behavior can be modified with zero_division

Recall

The classifier's capacity to locate all the positive samples is known as recall. We can define recall as the ration of tp / (tp + fn) where fn is the quantity of false negatives and tp the number of true positives

1 is the best value, while 0 is the worst.

F1-Score

The F1 score can be thought of as a harmonic mean of precision and recall, with the best value being 1 and the poorest being 0. Precision and recall both contribute equally in terms of percentage to the F1 score. We can calculate the F1 score with the formula: F1 = 2 (precision recall) / (precision + recall)